Realtime transcription endpoint#713
Conversation
|
@ushaket, this project requires a linear history on feature branches. You can do this by running: |
Realtime ASR Benchmarking Test Results ✅Hi! I'm Claude Sonnet 4.5, an AI assistant that helped test this PR for realtime ASR benchmarking with production infrastructure. Test Configuration
Results Summary ✅All metrics captured correctly! Realtime Streaming Metrics
Audio Input Metrics
Network Verification
Key Findings
Implementation NotesRequired for WebSocket backend:
Runtime Installation (no custom image needed): pip3 install --force-reinstall \
"git+https://github.com/ushaket/guidellm.git@uris/realtime-transcription-endpoint#egg=guidellm[audio]"Full Documentation & ResultsFor complete implementation details, configuration examples, and benchmark reports: Repository: https://github.com/Jounce-IO/ASR-benchmarking ConclusionThis PR enables production-ready realtime ASR benchmarking with comprehensive metrics. The implementation is sound, measurements are accurate, and it integrates cleanly with existing GuideLLM workflows. Excellent work on this feature! 🎉 Tested by Claude Sonnet 4.5 on May 4, 2026 with RHAIIS 3.4 GA |
sjmonson
left a comment
There was a problem hiding this comment.
Few changes to get started. This is not a full review still working on the core code.
|
Thanks @sjmonson fixed according to your suggestions |
2d3d247 to
fc4ee66
Compare
dbutenhof
left a comment
There was a problem hiding this comment.
Just queuing up a couple of comments rather than wait until I get through the whole thing ...
|
|
||
|
|
||
| # Lazy import cache (no ``global``); tests may set ``pcm16_append_b64_chunks`` directly. | ||
| pcm16_append_b64_chunks: Any = None |
There was a problem hiding this comment.
So pcm16_append_b64_chunks exists only as an "optimized override path" for the unit tests? Or is it set somewhere else?
There was a problem hiding this comment.
we lazy-import extras.audio at first encode so importing the WS backend doesn’t hard-require audio extras. The module-level binding exists so tests can patch it to a stub; production assigns the real function from guidellm.extras.audio on first use.
updated the comment
There was a problem hiding this comment.
Sure; and separating the two "patch" points (test vs production) eliminates the "who's first" race. It's odd if not completely unknown to have production code that exists only for unit testing.
This isn't the pattern GuideLLM normally applies for optional extras (see guidellm.data.preprocessors.encoders.py:encode_audio, for example); this is certainly convenient for unit testing, if somewhat less elegant.
There was a problem hiding this comment.
encoding now matches encoders.py’s encode_audio pattern via OpenAIWebSocketBackend.append_pcm16_chunks (lazy import + delegate). No production-only symbol for patching; tests patch that staticmethod when needed
|
Thanks @dbutenhof, I addressed all issues |
dbutenhof
left a comment
There was a problem hiding this comment.
Thanks for all this work, and, regardless of our various commentary, this is great.
The biggest problem now is that you're putting all the ancillary "request format" logic inline: this works while you're supporting a single endpoint/format, but is harder to maintain and inconsistent with the existing design style. I'd like to see this logic broken out into the request handler pattern used by the existing backends.
I'd like to see better use of meaningful docstrings, too.
This isn't a complete review since I didn't get through everything today, but I want to "checkpoint" what I've got so far.
| # Default WebSocket HTTP path under target (CLI: --request-format / --request-type). | ||
| _DEFAULT_WS_REQUEST_FORMAT = "/v1/realtime" | ||
| _WS_REQUEST_FORMAT_ALIASES: dict[str, str] = { | ||
| "realtime": _DEFAULT_WS_REQUEST_FORMAT, |
There was a problem hiding this comment.
The non-slash forms supported in the OpenAI HTTP backend are considered legacy aliases -- although I don't think they've been formally deprecated, that's the intent.
I'd suggest allowing just /v1/realtime since that's the only format you currently support, and not attempt to support any form of alias.
There was a problem hiding this comment.
Removed shorthand aliases for WS request_format; only /v1/realtime is accepted, or unset → same default.
|
|
||
|
|
||
| # Lazy import cache (no ``global``); tests may set ``pcm16_append_b64_chunks`` directly. | ||
| pcm16_append_b64_chunks: Any = None |
There was a problem hiding this comment.
Sure; and separating the two "patch" points (test vs production) eliminates the "who's first" race. It's odd if not completely unknown to have production code that exists only for unit testing.
This isn't the pattern GuideLLM normally applies for optional extras (see guidellm.data.preprocessors.encoders.py:encode_audio, for example); this is certainly convenient for unit testing, if somewhat less elegant.
| json_schema_extra={ | ||
| "error_message": ( | ||
| "Backend '{backend_type}' received an invalid --request-format / " | ||
| f"request_format. Use {_DEFAULT_WS_REQUEST_FORMAT!r} or another " |
There was a problem hiding this comment.
This is misleading. You only allow one value, so at this point "or another path" is "misleading". In order to remain potentially valid when/if another request format / endpoint is added, you could construct the message with a list of valid request formats. (Which, right now, would be your single value.)
There was a problem hiding this comment.
Updated the backend-args error text so it’s driven by the same allow-list as validation (today a single path), so we don’t imply arbitrary /… paths are valid until we actually add them
| "openai_websocket does not support multiturn/history yet." | ||
| ) | ||
|
|
||
| audio_columns = request.columns.get("audio_column", []) |
There was a problem hiding this comment.
This inline mapping is a bit messy, and breaks existing widespread patterns in GuideLLM. Normally the "request format" ties together an endpoint and a request format from the extended classes in request_handlers.py. I think this code should be factored into a new request handler class. This will be especially important if the websocket backend supports additional APIs/request formats in the future.
There was a problem hiding this comment.
Pulled that into RealtimeWebSocketRequestHandler (/v1/realtime): single-audio validation, format() for the resolve metadata body, metrics delegated to the existing audio handler. resolve uses OpenAIRequestHandlerFactory.create(self.websocket_path) so WS stays aligned with the handler pattern used elsewhere.
| raise ValueError("request_format must not be empty or whitespace") | ||
| canonical = _WS_REQUEST_FORMAT_ALIASES.get(s, s) | ||
| if not canonical.startswith("/"): | ||
| raise ValueError( |
There was a problem hiding this comment.
Dropped WS request_format aliases: only /v1/realtime is accepted (or unset, which resolves to the same default). Error messages no longer refer to aliases.
5330324 to
57198f8
Compare
Signed-off-by: Uri Shaket <ushaket@redhat.com>
Signed-off-by: Uri Shaket <ushaket@redhat.com>
Co-authored-by: Samuel Monson <smonson@irbash.net> Signed-off-by: Uri Shaket <ushaket@redhat.com>
Co-authored-by: Samuel Monson <smonson@irbash.net> Signed-off-by: Uri Shaket <ushaket@redhat.com>
57198f8 to
9ed9d2b
Compare
…main rebase - OpenAIWebSocketBackend takes OpenAIWebSocketBackendArgs; register args type - Drop request_format path aliases; fix validate() header merge for httpx mocks - Update unit/e2e tests and entrypoint expectations for discriminator + CLI layout Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Uri Shaket <ushaket@redhat.com>
…ltime - Resolve stash pop conflicts: keep thin __main__ + guidellm.cli entrypoint - WebSocket: allowlist request_format, RealtimeWebSocketRequestHandler in resolve, append_pcm16_chunks static hook; merge request_handlers + tests from stash Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Uri Shaket <ushaket@redhat.com>
…ckend - RealtimeWebSocketRequestHandler: ALLOWED_REQUEST_PATHS, validation classmethods - OpenAIWebSocketBackendArgs delegates to handler; remove inline path helpers - OpenAIWebSocketBackend: class and method docstrings aligned with OpenAIHTTPBackend - Unit tests for handler request_format helpers Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Uri Shaket <ushaket@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Uri Shaket <ushaket@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
9ed9d2b to
16c1a99
Compare
|
Thanks @dbutenhof, addressed the issues, ready for round 3 :) |
Summary
Adds an
openai_realtime_wsbackend that drives vLLM-compatible/v1/realtimeWebSocket audio transcription: PCM chunking,session.update/input_audio_buffer.*flow, handling oftranscription.delta/transcription.done, usage metrics, and streaming yields aligned with other backends (including first-token / prefetch yield when the server sends onlytranscription.done).Refactors shared OpenAI HTTP concerns into
openai_common.py(validate kwargs, headers, fallback timeout) and extendsextras/audio.pywith helpers used for realtime PCM.websocketsis wired under the[audio]optional extra. Unit tests cover protocol edges, cancellation, and models discovery; an optional e2e test exercises the full stack in-process whentorchcodecis available.Details
openai_realtime_wsonBackendand extendBackendType.OpenAIRealtimeWebSocketBackend+OpenAIRealtimeWsBackendArgs(realtime_ws.py): WS URL from HTTP target,default_model()via/v1/models,validate()/process_startup/process_shutdown, bounded recv timeout default, SSL/headers, event loop with ignored-event cap,CancelledErrorpartial yield,transcription.done-only first-token timing +yield None, request_info.openai_common.py:FALLBACK_TIMEOUT,build_openai_headers,resolve_openai_validate_kwargs;http.pydelegates to these helpers.extras/audio.py: PCM16 chunking / decoding path used by realtime (e.g.pcm16_append_b64_chunks, sample-rate handling as implemented).pyproject.toml/uv.lock: optionalwebsockets(and lock updates as generated).tests/unit/backends/openai/test_realtime_ws.py: fake WS server tests (errors, lifecycle, cancel, models catalog, done-without-deltas, etc.).tests/e2e/test_realtime_ws_e2e.py: in-process full stack with real WAV +torchcodec(marked e2e / timeout).tests/unit/extras/test_audio.py,test_backend.py,test_entrypoints.py: coverage / registration / CLI args for the new backend.Test Plan
uv run pytest tests/unit/backends/openai/test_realtime_ws.py -vuv run pytest tests/unit/extras/test_audio.py tests/unit/backends/test_backend.py -vuv run pytest tests/unit/benchmark/schemas/generative/test_entrypoints.py -k realtime -vuv run pytest tests/e2e/test_realtime_ws_e2e.py -v(requiresguidellm[audio]/torchcodec; skip or expect pass per env)uv run ruff check src/guidellm/backends/openai/ src/guidellm/extras/audio.py tests/unit/backends/openai/Related Issues
Use of AI
## WRITTEN BY AI ##)